NLG with narrator

Denis Abdullin

WHOAMI

  • Data Science in a Corporate
  • 7 years as individual contributor and manager
  • Johnson & Johnson >> Merck
  • Interests:
    • NLG

    • Web Applications

    • Predictive Analytics

    • Time Series

Package

narrator is a template-based NLG system that produces written narratives from a data set.

install.packages("devtools")
devtools::install_github("denisabd/narrator")

library(narrator)

NLG Systems

There are several approaches for creating text from data, two most used ones are:

  • Template-based NLG system
  • ML-based NLG system

Templates in R

There are many packages that accommodate different types of templates in R, but glue is a part of tidyverse and very easy to use. Simple template can look like:

Temperatures on {date} will reach the max of {max_temp} at {max_hour}.

When adding the actual variable values we get the text:

date <- "June 27th"
max_temp <- "28C"
max_hour <- "3PM"

glue::glue(template)
Temperatures on June 27th will reach the max of 28C at 3PM.

Simple Example

sales %>%
narrate_descriptive(measure = "Sales",
            dimensions = "Product")
$`Total Sales`
Total Sales across all Products is 38790478.4.

$`Product by Sales`
Outlying Products by Sales are Food & Beverage (15543469.7, 40.1 %), Electronics (8608962.8, 22.2 %).

Converting to HTML

sales %>%
narrate_descriptive(measure = "Sales",
            dimensions = "Product") %>%
  to_html()

Total Sales

Total Sales across all Products is 38790478.4.

Product by Sales

Outlying Products by Sales are Food & Beverage (15543469.7, 40.1 %), Electronics (8608962.8, 22.2 %).

To create narratives for reports, application convert it using to_html(), you can also add formatting to the numbers with format_numbers = TRUE in any narrate function.

Arguments

sales %>%
narrate_descriptive(measure = "Sales",
            dimensions = "Product", 
            format_numbers = TRUE) %>%
  to_html()

Total Sales

Total Sales across all Products is 38.79 M.

Product by Sales

Outlying Products by Sales are Food & Beverage (15.5 M, 40.1 %), Electronics (8.6 M, 22.2 %).

Arguments

sales %>%
narrate_descriptive(measure = "Sales",
            dimensions = "Product", 
            format_numbers = TRUE,
            coverage = 0.8) %>%
  to_html()

Total Sales

Total Sales across all Products is 38.79 M.

Product by Sales

Outlying Products by Sales are Food & Beverage (15.5 M, 40.1 %), Electronics (8.6 M, 22.2 %), Home (4.6 M, 11.9 %), Tools (4.4 M, 11.4 %).

Arguments

sales %>%
narrate_descriptive(measure = "Sales",
            dimensions = "Product", 
            format_numbers = TRUE,
            coverage = 0.8,
            coverage_limit = 3) %>%
  to_html()

Total Sales

Total Sales across all Products is 38.79 M.

Product by Sales

Outlying Products by Sales are Food & Beverage (15.5 M, 40.1 %), Electronics (8.6 M, 22.2 %), Home (4.6 M, 11.9 %).

Summarization

narrator works with both aggregated and non-aggregated data, one of the key features to make sure the narratives are correct is to use the right summarization option. By default it uses sum, but you can alternatively use average or count

sales %>%
  narrate_descriptive(
    measure = "Order ID", 
    dimensions = "Product",
    summarization = "count"
  )
$`Total Order ID`
Total Order ID across all Products is 10000.

$`Product by Order ID`
Outlying Products by Order ID are Food & Beverage (3552, 35.5 %), Electronics (1975, 19.8 %).

Templates

For a template-based system it is very useful to be able to change certain template and make output more flexible.

$`Total Order ID`
Order ID Volume across all Products is equal to 10000.

$`Product by Order ID`
Outlying Products by Order ID are Food & Beverage (3552, 35.5 %), Electronics (1975, 19.8 %).

Variables

Variables available for narrative generation can be accessed using return_data = TRUE argument in all narrate functions.

List Templates

To see all available templates at once use list_templates() function.

fun name template
narrate_descriptive template_total Total {measure} across all {pluralize(dimension_one)} is {total}.
narrate_descriptive template_average Average {measure} across all {pluralize(dimension_one)} is {total}.
narrate_descriptive template_outlier Outlying {dimension} by {measure} is {outlier_insight}.
narrate_descriptive template_outlier_multiple Outlying {pluralize(dimension)} by {measure} are {outlier_insight}.
narrate_descriptive template_outlier_l2 In {level_l1}, significant {level_l2} by {measure} is {outlier_insight}.
narrate_descriptive template_outlier_l2_multiple In {level_l1}, significant {pluralize(level_l2)} by {measure} are {outlier_insight}.

Trend Narratives

Great way to instantly generate insights around the development of certain metrics in time is creating so called trend narratives with narrate_trend() function. Let’s create a dataset with dates:

data <- sales %>%
 dplyr::mutate(Date = lubridate::floor_date(Date, unit = "month")) %>%
 dplyr::group_by(Region, Product, Date) %>%
 dplyr::summarise(Sales = sum(Sales, na.rm = TRUE))

data %>%
  dplyr::ungroup() %>%
  dplyr::slice(1:8) %>%
  reactable::reactable(bordered = TRUE, striped = TRUE)

Trend Narratives

Basic trend narrative analyzes the data year-over-year, narrator requires to have a date/datetime stamps for creating these

Year-over-Year

narrate_trend(data,
              type = "yoy") %>%
  to_html()

2021 YTD vs 2020 YTD

From 2020 YTD to 2021 YTD, Sales had an increase of 1.13 M (9.1 %, 12.42 M to 13.55 M).

Sales change by Region

Regions with biggest changes of Sales are NA (533.1 K, 9.1 %, 5.9 M to 6.4 M), EMEA (416.9 K, 9.91 %, 4.2 M to 4.6 M).

NA by Product

In NA, significant Products by Sales change are Food & Beverage (243.3 K, 9.92 %, 2.5 M to 2.7 M), Tools (186.8 K, 31.87 %, 585.9 K to 772.7 K).

EMEA by Product

In EMEA, significant Products by Sales change are Electronics (312.1 K, 35.88 %, 869.7 K to 1.2 M), Food & Beverage (238.2 K, 14.54 %, 1.6 M to 1.9 M).

Sales change by Product

Products with biggest changes of Sales are Food & Beverage (535.4 K, 10.63 %, 5 M to 5.6 M), Electronics (525.9 K, 19.79 %, 2.7 M to 3.2 M).

Food & Beverage by Month

In Food & Beverage, significant Months by Sales change are Oct (-141.6 K, -23.39 %, 605.4 K to 463.8 K), Sep (132.7 K, 37.27 %, 356.2 K to 489 K), Dec (118.3 K, 16.67 %, 709.5 K to 827.8 K), May (99 K, 28.12 %, 352 K to 451 K).

Electronics by Month

In Electronics, significant Months by Sales change are Nov (170.7 K, 70.62 %, 241.7 K to 412.4 K), Dec (108.3 K, 36.23 %, 298.8 K to 407.1 K), May (-74.1 K, -26.73 %, 277.3 K to 203.2 K), Feb (70.6 K, 38.24 %, 184.6 K to 255.3 K).

Sales change by Month

Months with biggest changes of Sales are Nov (386.5 K, 29.17 %, 1.3 M to 1.7 M), Apr (226.6 K, 24.4 %, 928.6 K to 1.2 M), Jan (162.2 K, 23.06 %, 703.4 K to 865.6 K).

Previous Period

narrate_trend(data, 
              type = "previous period") %>%
  to_html()

Dec 2021 vs Nov 2021

From Nov 2021 to Dec 2021, Sales had an increase of 176.32 K (10.3 %, 1.71 M to 1.89 M).

Sales change by Region

Regions with biggest changes of Sales are EMEA (154.2 K, 26.54 %, 580.8 K to 735 K), NA (-111 K, -13.14 %, 844.4 K to 733.4 K).

EMEA by Product

In EMEA, significant Products by Sales change are Food & Beverage (97.1 K, 45.41 %, 213.8 K to 310.9 K), Tools (60.1 K, 76.53 %, 78.6 K to 138.7 K).

NA by Product

In NA, significant Products by Sales change are Tools (-87.2 K, -55.54 %, 157 K to 69.8 K), Home (-51.4 K, -44.95 %, 114.3 K to 62.9 K).

Sales change by Product

Product with biggest changes of Sales is Food & Beverage (197.2 K, 31.28 %, 630.6 K to 827.8 K).

Same Period Last Year

narrate_trend(data, 
              type = 3) %>%
  to_html()

Dec 2021 vs Dec 2020

From Dec 2020 to Dec 2021, Sales had an increase of 92.89 K (5.2 %, 1.79 M to 1.89 M).

Sales change by Region

Regions with biggest changes of Sales are EMEA (184.3 K, 33.48 %, 550.6 K to 735 K), NA (-144.3 K, -16.44 %, 877.8 K to 733.4 K).

EMEA by Product

In EMEA, significant Products by Sales change are Food & Beverage (104.2 K, 50.43 %, 206.7 K to 310.9 K), Tools (78.5 K, 130.46 %, 60.2 K to 138.7 K).

NA by Product

In NA, significant Products by Sales change are Clothing (-71.6 K, -76.95 %, 93.1 K to 21.5 K), Home (-44.3 K, -41.31 %, 107.2 K to 62.9 K).

Sales change by Product

Products with biggest changes of Sales are Food & Beverage (118.3 K, 16.67 %, 709.5 K to 827.8 K), Electronics (108.3 K, 36.23 %, 298.8 K to 407.1 K), Baby (-79.3 K, -33.91 %, 233.9 K to 154.6 K).

ChatGPT

narrator can use ChatGPT API to improve your narratives. To do so you can either set use_chatgpt = TRUE in any function that creates narrative or use enhance_narrative() to improve existing narrative output. You can supply list or character, function will collapse all text into a sentence and send a request to Chat GPT. Set your token in .Renviron file as OPENAI_API_KEY or supply it to a function as openai_api_key argument.

This functionality requires you to setup the ChatGPT API key and make it accessible from R.

ChatGPT

narrative <- sales %>%
  narrate_descriptive(
    measure = "Sales",
    dimensions = c("Region", "Product"),
    use_chatgpt = TRUE
  )

cat(narrative)

The Total Sales of our company amount to $38,790,478.4, contributing to our success across all regions. However, our Outlying Regions stand out, with impressive Sales figures of $18,079,736.4, constituting 46.6% of the total Sales, followed by EMEA, with Sales of $13,555,412.7, comprising 34.9%. In our Outlying Region, Food & Beverage and Electronics have emerged as noteworthy Products, contributing $7,392,821 (40.9%) and $3,789,132.7 (21%) respectively, towards the impressive Sales figures. Similarly, in EMEA, Food & Beverage and Electronics have emerged as significant Products, contributing $5,265,113.2 (38.8%) and $3,182,803.4 (23.5%) respectively. Lastly, Food & Beverage and Electronics have been the Outlying Products driving our Sales, with a total contribution of $15,543,469.7 (40.1%) and $8,608,962.8 (22.2%) respectively.

Translation

Translate you text using translate_narrative() function, specify language argument in English:

translation <- translate_narrative(narrative, language = "Czech")
cat(translation)

Celkové tržby naší společnosti činí 38 790 478,4 dolarů a přispívají k našemu úspěchu ve všech oblastech. Významně se však vynořují výsledky pro Naše Okrajové oblasti, s impozantním prodejem ve výši 18 079 736,4 dolarů, což představuje 46,6 % z celkových prodejů. Následuje oblast EMEA s prodejem 13 555 412,7 dolarů, což znamená 34,9 %. V Našich Okrajových oblastech vynikly produkty Výživa a Nápoje a Elektronika, které přispěly impozantními prodejními výsledky ve výši 7 392 821 dolarů (40,9 %) a 3 789 132,7 dolarů (21 %) z celkových prodejů. Podobně ve v oblasti EMEA, Výživa a Nápoje a Elektronika vynikly jako významné produkty, které přispěly 5 265 113,2 dolarů (38,8 %) a 3 182 803,4 dolarů (23,5 %) k celkovým prodejům. Nakonec je Výživa a Nápoje a Elektronika produkty, které vedou prodeje v Našich Okrajových oblastech, a to s celkovým přínosem 15 543 469,7 dolarů (40,1 %) a 8 608 962,8 dolarů (22,2 %) z celkových prodejů.

Summarization

If your output is too verbose you can summarize it with summarize_narrative() function:

summarization <- summarize_narrative(narrative)
cat(summarization)

Our company’s total sales are $38,790,478.4, with our Outlying Regions contributing the most at 46.6%. These regions mainly sell Food & Beverage and Electronics, which together make up 40.9% of sales. Similarly, EMEA sells these products the most, accounting for 38.8% of sales. Overall, Food & Beverage and Electronics are the products driving our sales.

Resources